11 research outputs found
Microscopy Cell Segmentation via Adversarial Neural Networks
We present a novel method for cell segmentation in microscopy images which is
inspired by the Generative Adversarial Neural Network (GAN) approach. Our
framework is built on a pair of two competitive artificial neural networks,
with a unique architecture, termed Rib Cage, which are trained simultaneously
and together define a min-max game resulting in an accurate segmentation of a
given image. Our approach has two main strengths, similar to the GAN, the
method does not require a formulation of a loss function for the optimization
process. This allows training on a limited amount of annotated data in a weakly
supervised manner. Promising segmentation results on real fluorescent
microscopy data are presented. The code is freely available at:
https://github.com/arbellea/DeepCellSeg.gitComment: Accepted to IEEE International Symposium on Biomedical Imaging (ISBI)
201
PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data
Action recognition models have achieved impressive results by incorporating
scene-level annotations, such as objects, their relations, 3D structure, and
more. However, obtaining annotations of scene structure for videos requires a
significant amount of effort to gather and annotate, making these methods
expensive to train. In contrast, synthetic datasets generated by graphics
engines provide powerful alternatives for generating scene-level annotations
across multiple tasks. In this work, we propose an approach to leverage
synthetic scene data for improving video understanding. We present a multi-task
prompt learning approach for video transformers, where a shared video
transformer backbone is enhanced by a small set of specialized parameters for
each task. Specifically, we add a set of ``task prompts'', each corresponding
to a different task, and let each prompt predict task-related annotations. This
design allows the model to capture information shared among synthetic scene
tasks as well as information shared between synthetic scene tasks and a real
video downstream task throughout the entire network. We refer to this approach
as ``Promptonomy'', since the prompts model a task-related structure. We
propose the PromptonomyViT model (PViT), a video transformer that incorporates
various types of scene-level information from synthetic data using the
``Promptonomy'' approach. PViT shows strong performance improvements on
multiple video understanding tasks and datasets.Comment: Tech repor
Teaching Structured Vision&Language Concepts to Vision&Language Models
Vision and Language (VL) models have demonstrated remarkable zero-shot
performance in a variety of tasks. However, some aspects of complex language
understanding still remain a challenge. We introduce the collective notion of
Structured Vision&Language Concepts (SVLC) which includes object attributes,
relations, and states which are present in the text and visible in the image.
Recent studies have shown that even the best VL models struggle with SVLC. A
possible way of fixing this issue is by collecting dedicated datasets for
teaching each SVLC type, yet this might be expensive and time-consuming.
Instead, we propose a more elegant data-driven approach for enhancing VL
models' understanding of SVLCs that makes more effective use of existing VL
pre-training datasets and does not require any additional data. While automatic
understanding of image structure still remains largely unsolved, language
structure is much better modeled and understood, allowing for its effective
utilization in teaching VL models. In this paper, we propose various techniques
based on language structure understanding that can be used to manipulate the
textual part of off-the-shelf paired VL datasets. VL models trained with the
updated data exhibit a significant improvement of up to 15% in their SVLC
understanding with only a mild degradation in their zero-shot capabilities both
when training from scratch or fine-tuning a pre-trained model
Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models
Vision and Language (VL) models offer an effective method for aligning
representation spaces of images and text, leading to numerous applications such
as cross-modal retrieval, visual question answering, captioning, and more.
However, the aligned image-text spaces learned by all the popular VL models are
still suffering from the so-called `object bias' - their representations behave
as `bags of nouns', mostly ignoring or downsizing the attributes, relations,
and states of objects described/appearing in texts/images. Although some great
attempts at fixing these `compositional reasoning' issues were proposed in the
recent literature, the problem is still far from being solved. In this paper,
we uncover two factors limiting the VL models' compositional reasoning
performance. These two factors are properties of the paired VL dataset used for
finetuning and pre-training the VL model: (i) the caption quality, or in other
words `image-alignment', of the texts; and (ii) the `density' of the captions
in the sense of mentioning all the details appearing on the image. We propose a
fine-tuning approach for automatically treating these factors leveraging a
standard VL dataset (CC3M). Applied to CLIP, we demonstrate its significant
compositional reasoning performance increase of up to over the base
model, up to over the strongest baseline, and by on average
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
Vision and Language (VL) models have demonstrated remarkable zero-shot
performance in a variety of tasks. However, recent studies have shown that even
the best VL models struggle to capture aspects of scene understanding, such as
object attributes, relationships, and action states. In contrast, obtaining
structured annotations, e.g., scene graphs (SGs) that could improve these
models is time-consuming, costly, and tedious, and thus cannot be used on a
large scale. Here we ask, can small datasets containing SG annotations provide
sufficient information for enhancing structured understanding of VL models? We
show that it is indeed possible to improve VL models using such data by
utilizing a specialized model architecture and a new training paradigm. Our
approach captures structure-related information for both the visual and textual
encoders by directly supervising both components when learning from SG labels.
We use scene graph supervision to generate fine-grained captions based on
various graph augmentations highlighting different compositional aspects of the
scene, and to predict SG information using an open vocabulary approach by
adding special ``Adaptive SG tokens'' to the visual encoder. Moreover, we
design a new adaptation technique tailored specifically to the SG tokens that
allows better learning of the graph prediction task while still maintaining
zero-shot capabilities. Our model shows strong performance improvements on the
Winoground and VL-checklist datasets with only a mild degradation in zero-shot
performance.Comment: Tech Repor